3.5 MCMC

#MonteCarlo #MCMC #GibbsSampler #MarkovChain #GaussianHierarchicalModel

1 MCMC

Hierarchical models can get very complex quickly, creating big computational headaches. For $λ (θ | x) = \frac{p_{θ} (x) λ (θ)}{\int_{Ω} p_{ζ} (x) λ (ζ) d ζ},$ the integral is often intractable.
A computational strategy is to set up a Markov chain with stationary distribution $\propto p_{θ} (x) λ (θ)$ , and run it to get approximate samples from $λ (θ | x)$ .

Compared with Markov chain in probability, the notation here is more Bayesian.

Markov Chain

A (stationary) Markov chain with transition kernel $Q (y | x)$ and initial distribution $π_{0} (x)$ is a sequence of RVs $X^{(0)}, X^{(1)}, \dots$ where $X^{(0)} \sim π_{0}$ and $X^{(t + 1)} | X^{(0)}, \dots, X^{(t)} \sim Q (\cdot | X^{(t)})$ , where $Q (y | x) = P (X^{(t + 1)} = y | X^{(t)} = x)$ .

The marginal distribution of $X^{(1)}$ is $π_{1} (y) = P (X^{(1)} = y) = \int_{X} Q (y | x) π_{0} (x) d μ (x) .$
This is a directed graphical model: $X^{(0)} \to X^{(1)} \to X^{(2)} \to \dots$
If $π (y) = \int_{X} Q (y | x) π (x) d μ (x)$ we say $π$ is a stationary distribution for $Q$ .
A sufficient condition for stationary distribution is detailed balance: $π (x) Q (y | x) = π (y) Q (x | y), \forall x, y .$

Because this leads to $π (y) = π (y) \int_{X} Q (x | y) d μ (x) = \int_{X} Q (y | x) π (x) d μ (x) .$

A Markov chain with detailed balance is called reversible, i.e. if $π_{0} = π$ , $(X^{(0)}, \dots, X^{(t)}) \overset{d}{=} (X^{(t)}, \dots, X^{(0)}) .$

Note that $\begin{aligned} P (X^{(t)} = x | X^{(t + 1)} = y) & = \frac{P (X^{(t)} = x) P (X^{(t + 1)} = y | X^{(t)} = x)}{P (X^{(t + 1)} = y)} \\ = \frac{π (x) Q (y | x)}{π (y)} = Q (x | y) . \end{aligned}$

Theorem

If an MC with stationary distribution $π$ is

Irreducible: $\forall x, y, \exists n$ : $p (X^{(n)} = y | X^{(0)} = x) > 0$ .
Aperiodic: $\forall x, \gcd {n > 0 | p (X^{(n)} = x | X^{(0)} = x) > 0} = 1$ .

Then $L (X^{(t)}) \to π$ regardless of $π_{0}$ .

The proof is beyond scope of this note.

So the strategy is to find $Q$ with stationary distribution $λ (θ | X)$ , starting at any $X$ , run chain for a long time, and we have $X^{(t)} \approx$ sample from posterior, for large $t$ .

2 Gibbs Sampler

Denote the parameter vector $θ = (θ_{1}, \dots, θ_{d})$ .

Gibbs Sampler

Initialize $θ = θ^{(0)}$ .
For $t = 1, \dots, T$ :
- For $j = 1, \dots, d$ ,
  - Sample $θ_{j} \in λ (θ_{j} | θ_{∖ j}, x)$ (*)
- Record $θ^{(t)} = θ$ .

$θ_{∖ j} = (θ_{1}, \dots, θ_{j - 1}, θ_{j + 1}, \dots, θ_{d})$ means parameters apart from $θ_{j}$ .

Variations on (*):

Update one random coordinate $J^{(t)} \sim Uniform {0, \dots, d}$ .
Update coordinates in random order.

Advantage for hierarchical priors:

Only need to sample low-dimensional conditional distributions: $λ (θ_{j} | θ_{∖ j}, X) \propto p (θ_{j} | θ_{p_{a} (j)}) \cdot \prod_{i : j \in P_{a} (i)} p (θ_{i} | θ_{p_{a} (i)}) .$ This is especially easy if using conjugate priors at all levels, often can be parallelized.

2.1 Stationarity of $λ (θ | X)$

Claim

If $θ^{(t)} \sim λ (θ | X)$ then $θ^{(t + 1)} \sim λ (θ | X)$ .

Proof

Consider updating any one (fixed) coordinate $j$ : $η_{∖ j} = θ_{∖ j}, η_{j} \sim λ (θ_{j} | θ_{∖ j}, X) .$
If $θ \sim λ (θ | X) = λ (θ_{∖ j} | X) λ (θ_{j} | θ_{∖ j}, X)$ , then $\begin{aligned} η_{∖ j} & = θ_{∖ j} \sim λ (θ_{∖ j} | X) \\ η_{j} | η_{∖ j} & \sim λ (θ_{j} | θ_{∖ j}, X) = λ (θ_{j} | η_{∖ j}, X), \end{aligned}$
so $η \sim λ (θ, X)$ .
So updating any coordinate preserves posterior distribution, and updating coordinates in any order also does.

3 MCMC in Practice

In theory, pick any initialization $θ^{(0)}$ and valid kernel $Q$ , sample long enough s.t. $θ^{(t)} \sim λ (θ | X)$ .
Do it again $N$ more times, we have $N$ samples from $λ (θ | X)$ .
In practice, we need to determine if we've sampled long enough.

Pasted image 20241209151131.png
For the last (desirable) one, the posterior mean is $\frac{1}{N + 1} \sum_{k = 0}^{N} θ_{j}^{B + k s} \to E [θ_{j} | X] .$

Implementation details matter!

$θ_{1}, θ_{2} \overset{i . n . d}{\sim} N (0, 1)$ , $X_{i} | θ \overset{i . i . d}{\sim} N (θ_{1} + θ_{2}, 1)$ , $i = 1, \dots, n$ , so $(\begin{matrix} θ_{1} \\ θ_{2} \\ \overset{―}{X} \end{matrix}) \sim N_{3} (0, (\begin{matrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 2 + \frac{1}{n} \end{matrix})),$

θ | \overset{―}{X} \sim N_{2} (μ (\overset{―}{X}), Σ (\overset{―}{X})), $ $ s o $ $ \begin{aligned} μ (\overset{―}{X}) & = (\begin{array}{c} 1 \\ 1 \end{array}) {(2 + \frac{1}{n})}^{- 1} \overset{―}{X} = \frac{n \overset{―}{X}}{2 n + 1} (\begin{array}{c} 1 \\ 1 \end{array}), \\ Σ (\overset{―}{X}) & = I - (\begin{array}{c} 1 \\ 1 \end{array}) {(2 + \frac{1}{n})}^{- 1} (\begin{array}{c} 1 & 1 \end{array}) = \frac{n + 1}{2 n + 1} (\begin{array}{c} 1 & - \frac{n}{n + 1} \\ - \frac{n}{n + 1} & 1 \end{array}) . \end{aligned}

But this will cause Gibbs to take a long time to mix. (The image is a very sharp ellipse.) A better parameterization is ${\begin{aligned} β_{1} = θ_{1} + θ_{2}, \\ β_{2} = θ_{1} - θ_{2}, \end{aligned}$ so $β_{1} ⊥ ⊥ β_{2} | X$ . The Gibbs sampler is equivalent to directly sampling from posterior.

4 Empirical Bayes

Back to the Gaussian hierarchical model $\frac{1}{d} | | X | |^{2} \sim \frac{1 + τ^{2}}{d} χ_{d}^{2} \sim (1 + τ^{2}, \frac{2 (1 + τ^{2})}{d}),$ the MLE for $1 + τ^{2}$ is $\frac{1}{d} | | X | |^{2}$ .
For any "reasonable" prior, $E [ζ | X] \approx \frac{d}{| | X | |^{2}}$ , which should be close to MLE. ${\hat{θ}}_{i} \approx (1 - \frac{d}{| | X | |^{2}}) \approx (1 - ζ) X_{i} .$

If prior doesn't matter much, why use one? Could just estimate $ζ$ from data however we want, "plug it in".

UMVUE: $\hat{ζ} = \frac{d - 2}{| | X | |^{2}}$ .

Call Empirical Bayes a hybrid approach in which hyperparameters are treated as fixed, others treated as random.